Learning Objectives

By the end of this practical lab you will be able to:

  1. Learn how R can be used for web analysis
  2. Have the skills to utilize web mapping servers to provide contextual detail for maps within R
  3. Use geocoding and routing functionality to plot locations

Introduction

This practical lab concerns integrating the web with R, both as a source of data or as an analytics platform. These connections utilize Application Programming Interfaces (API) which enable data queries or analytics to be run, returning the results within R. Much of the complexity to these interfaces are hidden by the R packages that will be demonstrated here and therefore are quite accessible.

Reading data from the web

One of the simplest ways in which you can read data from the web is by using some of the same functionality for reading local files. For example, we can read a CSV file of municipal swimming pool locations in Brisbane, Australia as follows:

# The local file location is swapped for a remote url
swimming_pools <- read.csv("https://www.data.brisbane.qld.gov.au/data/dataset/ccf67d3e-cfaf-4d30-8b78-a794c783af9f/resource/c09546c8-9526-4358-a1eb-81dbb224cdca/download/Pools-location-and-information-09Dec16.csv")

#Show the top six rows
head(swimming_pools)
##                           Name                                    Address
## 1  Acacia Ridge Leisure Centre         1391 Beaudesert Road, Acacia Ridge
## 2              Bellbowrie Pool                 47 Birkin Road, Bellbowrie
## 3      Carole Park Swim Centre Cnr Boundary Road and Waterford Road Wacol
## 4 Centenary Pool (Spring Hill)           400 Gregory Terrace, Spring Hill
## 5               Chermside Pool               375 Hamilton Road, Chermside
## 6  Colmslie Pool (Morningside)               400 Lytton Road, Morningside
##       Phone_No Phone_No_2
## 1    3277 8686           
## 2    3202 6620           
## 3 1300 332 583  3271 6116
## 4 1300 332 583           
## 5 1300 252 583           
## 6 1300 733 053           
##                                                                                                                                                                                                                                                                                                                                                          Opening_Hours
## 1                                                                        SUMMER HOURS (from 17 September 2016)\nMonday to Thursday: 6am-7pm\nFriday: 6am-6pm\nSaturday: 8am-5pm\nSunday: 9am-5pm\nPublic holidays: 12pm-5pm. Closed Christmas Day, Good Friday and Anzac Day.\n\nOutdoor Lagoon Pool\nSaturday and Sunday:  11am-5pm\nSchool Holidays: 11am-5pm daily 
## 2                                                                                                                                                                                  Opening Early for the Summer Season from Monday 5th September\nMonday \x96 Thursday 8am \x96 11.30am\n3.30pm \x96 5.30pm\nFriday \x96 8am \x96 12pm\nSaturday & Sunday 8am \x96 6pm
## 3                                                                            SUMMER HOURS (17 September 2016 to 29 March 2017)\nMonday to Friday: 5.30am-7.30am and 11am-6pm\nSaturday: 10am-4pm\nSunday: 10am-4pm\nPublic holidays: 11am-4pm\nClosed Christmas Day and Good Friday\n\nWINTER HOURS\nClosed 30 March to 16 September 2016. Re-opens 17 September 2016.
## 4 NORMAL HOURS \nPool and health club hours (excluding dive pool) \nMonday - Thursday: 5am-8pm\nFriday: 5am-6pm\nSaturday to Sunday: 7am-6pm\nPublic holidays: 9am-5pm. Closed Christmas Day and Good Friday.\n\nDive pool (opening hours for public use)\nMonday-Friday: closed\nSaturday: 1-3pm\nSunday and school holidays: 11.30am-1.30pm\nPublic holidays: closed
## 5                                                                                                                                                                        SUMMER HOURS (from 17 September 2016)\nMonday to Thursday: 5am-8pm\nFriday: 5am-7pm\nSaturday and Sunday: 7am-6pm\nPublic holidays: 9am-5pm. Closed Christmas Day, Good Friday and Anzac Day.
## 6                                                       This pool is open all year round, except for the Kids Fun Pool pool which is only open during summer (September - April).\n\nVENUE HOURS\nMonday - Thursday: 5:30am-8pm\nFriday: 5:30am-6pm\nSaturday: 7.30am-6pm\nSunday: 8am-6pm\nPublic holidays: 9am-5pm. Closed Christmas Day, Good Friday and Anzac Day.
##                                                                                                                                                                                                          Facilities
## 1                                     Aqua aerobics, Disabled Access/Facilities,  Enclosed Program pool, Indoor heated Pool, Outdoor pool, Lifeguards, Open in winter, Squad Swimming, Swimming lessons, Water play
## 2                                            Disabled Access/Facilities, Heated pool, Indoor Pool, Lifeguards, Outdoor pool, Squad Swimming, Stroke Development, Swimming lessons, Wading pool, Water play, Caf\xe9
## 3                                                                                                                        Aqua-aerobics, Heated pool, Kiosk, Lifeguards, Outdoor pool, Swimming lessons, Wading pool
## 4                                    Aqua aerobics, Diving, Gym, Heated pool, Kiosk, Open in winter, Outdoor pool, Squad Swimming, Stroke Development, Swim Fit, Swimming lessons, Wading pool, caf\xe9, water polo
## 5 Aqua aerobics, Disabled Access/Facilities, Heated pool, Indoor Pool, Kiosk, Leisure Centre/Water Park, Lifeguards, Open in winter, Outdoor pool, Squad Swimming, Stroke Development, Swimming lessons, Water play
## 6                                                   Aqua aerobics, Disabled Access/Facilities, Heated pool, Indoor Pool, Kiosk, Lifeguards, Open in winter, Outdoor pool, Swimming lessons, Wading pool, Water play
##   Disability_Access                                             Parking
## 1               Yes Free Car parking: 120 spaces; Public transport: Bus
## 2               Yes          Free Car Parking ; Public transport: Buses
## 3                No     Free Car parking; Public transport: Bus & train
## 4               Yes             Car parking available; Public Transport
## 5               Yes           Car park available; Public transport: Bus
## 6               Yes                                  Car park available
##    Latitude Longitude
## 1 -27.58616  153.0264
## 2 -27.56547  152.8911
## 3 -27.60744  152.9315
## 4 -27.45537  153.0251
## 5 -27.38583  153.0351
## 6 -27.45516  153.0789

Reading special file formats such as JSON require additional packages such as jsonlite(). In this section, we will use this library to retrieve a JSON file from a Web API. First install and load jsonlite:

#Install jsonlite
install.packages("jsonlite")
#Load Package
library(jsonlite)

Generally a Web API is a service that receives requests or queries from users and returns a result via a web protocol (mainly HTTP). In this way, users can ask for and use data even without knowing how data are stored and processed. Due to the popularity of JavaScript in the WWW, JSON has become the most popular file format served by Web APIs.

In the following example we pull live station data from the San Francisco bike share scheme:

bikes<- fromJSON(txt="http://feeds.bayareabikeshare.com/stations/stations.json")

The bikes object is a list; the first entry returning the query time:

bikes[1]
## $executionTime
## [1] "2016-12-24 01:11:12 PM"

And the second element the data, which we will use to create a new data frame object “bikes_SF”

bikes_SF <- data.frame(bikes[2])
head(bikes_SF)
##   stationBeanList.id       stationBeanList.stationName
## 1                  2 San Jose Diridon Caltrain Station
## 2                  3             San Jose Civic Center
## 3                  4            Santa Clara at Almaden
## 4                  5                  Adobe on Almaden
## 5                  6                  San Pedro Square
## 6                  7              Paseo de San Antonio
##   stationBeanList.availableDocks stationBeanList.totalDocks
## 1                             15                         27
## 2                              5                         15
## 3                              9                         11
## 4                             16                         19
## 5                             10                         15
## 6                              5                         15
##   stationBeanList.latitude stationBeanList.longitude
## 1                 37.32973                 -121.9018
## 2                 37.33070                 -121.8890
## 3                 37.33399                 -121.8949
## 4                 37.33141                 -121.8932
## 5                 37.33672                 -121.8941
## 6                 37.33380                 -121.8869
##   stationBeanList.statusValue stationBeanList.statusKey
## 1                  In Service                         1
## 2                  In Service                         1
## 3                  In Service                         1
## 4                  In Service                         1
## 5                  In Service                         1
## 6                  In Service                         1
##   stationBeanList.status stationBeanList.availableBikes
## 1             IN_SERVICE                             12
## 2             IN_SERVICE                             10
## 3             IN_SERVICE                              2
## 4             IN_SERVICE                              3
## 5             IN_SERVICE                              4
## 6             IN_SERVICE                             10
##          stationBeanList.stAddress1 stationBeanList.stAddress2
## 1 San Jose Diridon Caltrain Station                           
## 2             San Jose Civic Center                           
## 3            Santa Clara at Almaden                           
## 4                  Adobe on Almaden                           
## 5                  San Pedro Square                           
## 6              Paseo de San Antonio                           
##   stationBeanList.city stationBeanList.postalCode stationBeanList.location
## 1             San Jose                                     Crandall Street
## 2             San Jose                                 W San Carlos Street
## 3             San Jose                                W Santa Clara Street
## 4             San Jose                                   Almaden Boulevard
## 5             San Jose                                  N San Pedro Street
## 6             San Jose                                Paseo de San Antonio
##   stationBeanList.altitude stationBeanList.testStation
## 1                                                FALSE
## 2                                                FALSE
## 3                                                FALSE
## 4                                                FALSE
## 5                                                FALSE
## 6                                                FALSE
##   stationBeanList.lastCommunicationTime stationBeanList.landMark
## 1                   2016-12-24 13:10:44                 San Jose
## 2                   2016-12-24 13:10:08                 San Jose
## 3                   2016-12-24 13:07:11                 San Jose
## 4                   2016-12-24 13:09:14                 San Jose
## 5                   2016-12-24 13:10:48                 San Jose
## 6                   2016-12-24 13:08:07                 San Jose
##   stationBeanList.is_renting
## 1                       TRUE
## 2                       TRUE
## 3                       TRUE
## 4                       TRUE
## 5                       TRUE
## 6                       TRUE

Querying an API - Twitter as an Example of Social Media Data

Although we used a fixed API endpoint in the last section to pull down a set of live data for a bike share scheme, many API can be supplied with queries that can flexibly return a subset of live data. In this example we will use the Twitter API using the rtweet package. However, before you do this, you will need to go to setup a consumer token/secret (you will need a Twitter Account to do this).

Next we will install and load the rtweet package:

# Install Package
#install.packages("rtweet")
# Although rtweet on CRAN is functional, the latest version on github has many bug fixes
install.packages("githubinstall")
githubinstall::gh_install_packages("rtweet",  ref = "5ef897e",dependencies=TRUE, ask= FALSE)
#Load package
library("rtweet")

We will then create a new “token” environment for this application using the consumer key and consumer secret you just created - this uses the create_token() function:

# Replace xxxx's with the values you copied from twitter
my_tokens <- create_token(app = "rtweet_tokens", #whatever you named your app
    consumer_key = "xxxxxxxxxxxxxxxxx",
    consumer_secret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")

It is also possible to specify multiple tokens which is helpful as rtweet will use this to automatically mitigate some of the issues with Twitter API limits. An example is shown here and uses the c() function.

We then want to save the my_tokens object so that we can reuse it later if we create a new R session.

save(my_tokens, file = "./my_tokens_env") # If you close down R, you can load this in a new session using "load()"

There are many things you can do with the Twitter API (full details available at: https://dev.twitter.com/rest/public). For example, you can access details of a Twitter user such as name, followers, friends, status, etc. For more details, have a look at the package documentation.

Let’s try to get some details of a Twitter account; in this case we will choose the official account for the City of Boulder, CO, USA:

# firstly we need to search for the user
BCO <- lookup_users('bouldercolorado', token = my_tokens)

# You can print all these details as follows
BCO
##    user_id            name     screen_name          location
## 1 16666502 City of Boulder bouldercolorado Boulder, Colorado
##                                                                                                                            description
## 1 The City of Boulder municipal government has served the Boulder community since 1871. Follows and retweets do not imply endorsement.
##                      url protected followers_count friends_count
## 1 http://t.co/ZnTy8Nb0qR     FALSE           55146           338
##   listed_count          created_at favourites_count utc_offset
## 1          806 2008-10-09 14:08:38              711     -25200
##                     time_zone geo_enabled verified statuses_count lang
## 1 Mountain Time (US & Canada)       FALSE     TRUE          10869   en
##   description_urls
## 1
# Or, you can also print specific attributes
BCO$followers_count #shows the number of followers
## [1] 55146

We will now extract a list of the followers; the token API limit for this is 75,000, so we will add the “n = ‘all’” option given that the number of followers is less than this:

#Gets the Twitter IDs of all the BCO followers
followers_BCO <- get_followers("bouldercolorado", n = 'all',token = my_tokens)

Using the follower list we can query again to return further details about these users. However, we need to be careful that we do not exceed the Twitter API limits. It is therefore recommended that you run this as a sample or that you load the R data object that has been created for you based on a query run on 22/12/16.

# This will pull down the user data for the first 18,000 users which is the limit. You can split the queried users into multiple calls to return the whole list, although you will hit API limits. The following code is commented out, but you can load the object created in the next code block or simply use a sample
# followers_BCO_Details <- lookup_users(followers_BCO,token = my_tokens)
# followers_BCO_Details_2 <- lookup_users(followers_BCO[18001:36000,],token = my_tokens)
# followers_BCO_Details_3 <- lookup_users(followers_BCO[36001:46000,],token = my_tokens)
# followers_BCO_Details_4 <- lookup_users(followers_BCO[46001:nrow(followers_BCO),],token = my_tokens)
# Combine all results
# followers_BCO_Details_All <- rbind(followers_BCO_Details,followers_BCO_Details_2,followers_BCO_Details_3,followers_BCO_Details_4)
# save(followers_BCO_Details_All,file="./data/followers_BCO_Details_All.Rdata")

#You might also just sample a smaller number of users
#followers_BCO_Details_Sample <- lookup_users(followers_BCO[sample(1:nrow(followers_BCO),1000),],token = my_tokens)
#Load Follower Details
load("./data/followers_BCO_Details_All.Rdata")

We can now look at the details of the table generated:

head(followers_BCO_Details_All)
##              user_id             name     screen_name location
## 1 811649029289324544     Renay Fraire    fraire_renay     <NA>
## 2 811648962331430912       Jim Turley      JimTurley9     <NA>
## 3 811643485065383936       Rafalnisan     Rafalnisan3     <NA>
## 4 811643278852358144 Gabriel Landeros GabrielLandero7     <NA>
## 5 811643080797405184     Josephpalomo   Josephpalomo5     <NA>
## 6         3927574814        neishaaaa        ne_bebby     <NA>
##                                                                                                description
## 1                                                                                                     <NA>
## 2                                                                                                     <NA>
## 3                                                                                                     <NA>
## 4 Hey lurk my page.\U0001f917\nHave fun.\U0001f609\nDon't be shy to message me im nice\U0001f60b\U0001f601
## 5                                                                                                     <NA>
## 6                                                                           Take some \U0001f61b\U0001f917
##    url protected followers_count friends_count listed_count
## 1 <NA>     FALSE               2            77            0
## 2 <NA>     FALSE               1            30            0
## 3 <NA>     FALSE               0            21            0
## 4 <NA>     FALSE               0            26            0
## 5 <NA>     FALSE               0            21            0
## 6 <NA>     FALSE             509          1836            0
##            created_at favourites_count utc_offset time_zone geo_enabled
## 1 2016-12-21 19:06:23                0         NA      <NA>       FALSE
## 2 2016-12-21 19:06:07                0         NA      <NA>       FALSE
## 3 2016-12-21 18:44:21                0         NA      <NA>       FALSE
## 4 2016-12-21 18:43:32                1         NA      <NA>       FALSE
## 5 2016-12-21 18:42:44                0         NA      <NA>       FALSE
## 6 2015-10-17 18:32:06               31         NA      <NA>       FALSE
##   verified statuses_count lang description_urls
## 1    FALSE              0   en                 
## 2    FALSE              1   en                 
## 3    FALSE              1   en                 
## 4    FALSE              4   en                 
## 5    FALSE              0   en                 
## 6    FALSE            417   en

A useful attribute is “location” which is a user specified location for the account. These are distributed as follows, so roughly 54 %:

# The use of the is.na() function test if the field location has a missing value; preceeding with an ! inverts the function (i.e. not), so we test how many are not NA
table(!is.na(followers_BCO_Details_All$location))
## 
## FALSE  TRUE 
## 22636 26935

We will now limit the dataset to just these records:

followers_BCO_Details_GEO <-followers_BCO_Details_All[!is.na(followers_BCO_Details_All$location),]
# Show the top 6 rows
head(followers_BCO_Details_GEO)
##               user_id                 name     screen_name
## 7          2536233210 Masonic Lodge Meeker     LodgeMeeker
## 8  811636439695794176              Valeria         Valezs2
## 11          270730916         Ellene Duffy     elleneduffy
## 16 811624230865408000       BRENDA ARMENTA BRENDAARMENTA15
## 21 811602977811144704               Kelsie      KelsieLion
## 23 811602232915263488             Vy2 LuLu   Vy2BlackHeart
##                   location
## 7  7th and Park Meeker, CO
## 8           Estados Unidos
## 11              Golden, CO
## 16                Eloy, AZ
## 21       West Columbia, SC
## 23             Atlanta, GA
##                                                                                                                                                       description
## 7  Rio Blanco Lodge #80 every 2nd and 4th Thursday of the month 7:00 Park and 7th Meeker Colorado. NOTE: Not all views posted/ followings are those of the Lodge.
## 8                                                                                      Actores y actrices Pop Películas Rock Clásico Fotografía Dance/Electrónica
## 11                                                                                                                                                           <NA>
## 16                                                                                                                                                           <NA>
## 21                                                                                                                                                          ARMY❤
## 23                                                                         Huge otaku and gamer\nAnime/Music lover\nAnime, Manga,  Cosplay ❤\nPsn:Killerluisx666x
##                        url protected followers_count friends_count
## 7  https://t.co/t3WPVPf5Gg     FALSE             661          1660
## 8                     <NA>     FALSE              27           176
## 11                    <NA>     FALSE               1             6
## 16                    <NA>      TRUE               0            93
## 21                    <NA>     FALSE               4            43
## 23                    <NA>     FALSE               4            71
##    listed_count          created_at favourites_count utc_offset time_zone
## 7            11 2014-05-31 01:19:30              672         NA      <NA>
## 8             0 2016-12-21 18:16:21               20         NA      <NA>
## 11            0 2011-03-23 04:30:48                0         NA      <NA>
## 16            0 2016-12-21 17:27:50                3         NA      <NA>
## 21            0 2016-12-21 16:03:23                4         NA      <NA>
## 23            0 2016-12-21 16:00:25                0         NA      <NA>
##    geo_enabled verified statuses_count lang description_urls
## 7        FALSE    FALSE            481   en                 
## 8        FALSE    FALSE              7   es                 
## 11       FALSE    FALSE              2   en                 
## 16       FALSE    FALSE              0   en                 
## 21       FALSE    FALSE              1   en                 
## 23       FALSE    FALSE              0   en

The web as an analytics and mapping platform

Although we covered some aspects of using web enabled infrastructure to conduct remote queries previously (see 2. Data Manipulation in R), there are an array of ways in which different services can be utilized from within R. Here we will explore the ggmap package, which extends the mapping capabilities of ggplot.

In the previous section we created a list of Twitter accounts based on followers of the City of Boulder, CO Twitter account and have limited these to those with user specified locations. If you view these details, it is obvious that these are of variable quality (in terms of actually being places), however, a substantial proportion do relate to geographic locations. First install and load ggmap:

install.packages("ggmap")
library(ggmap)
## Loading required package: ggplot2

We will now write some code that will attempt to geocode the locations. First we will extract a list of locations and their frequency:

# List frq table of locations
Locations <- data.frame(table(followers_BCO_Details_GEO$location))
# Sort in decending order
Locations <- Locations[order(-Locations$Freq),]

The distribution has a very long tail, with many locations appearing only once. Using the Google geocoding API this has a call limit of 2500 so we will first select all those locations with a frequency over 2 which results in 1151 records. We will then add a random sample of -1349 of the locations with a single frequency.

# create a sample of locations with a frequency over 1
A <- Locations[Locations$Freq > 1,]
# create a sample of locations with a frequency of 1
B <- Locations[Locations$Freq == 1,]
#Randomly select rows that when added to A will make the total rows 2500
B <- B[sample(1:nrow(B),(2500 - nrow(A))),] 
#Combine the two together and keep just the locations
sample_locations <- as.character(rbind(A,B)[,"Var1"])
#Show the first six locations
head(sample_locations)
## [1] "Boulder, CO"       "Denver, CO"        "Boulder, Colorado"
## [4] "Colorado"          "Colorado, USA"     "United States"

Geocoding is managed very simply with the geocode() function, accepting a character object of names to search. This has been commented out as the geocoding has already been run; and we have saved this in the “U_Locations_Geocode.Rdata” object which we also have loaded:

#geocode sample
# U_Locations_Geocode <- geocode(sample_locations,output="latlon",source="google")
# save(U_Locations_Geocode, file = "./data/U_Locations_Geocode.Rdata")
# Load the geocoding results
load("./data/U_Locations_Geocode.Rdata")

Next we need to join the locations to the sample locations - these align as the places within “sample_locations” were geocoded in the order that they appear in the data frame object. As such we can use cbind() to “column bind” the two objects together:

# Column bind the two data frame object
sample_locations_geocoded <- cbind(sample_locations,U_Locations_Geocode)
# Show the first 6 rows
head(sample_locations_geocoded)
##    sample_locations        lon      lat
## 1       Boulder, CO -105.27055 40.01499
## 2        Denver, CO -104.99025 39.73924
## 3 Boulder, Colorado -105.27055 40.01499
## 4          Colorado -105.78207 39.55005
## 5     Colorado, USA -105.78207 39.55005
## 6     United States  -95.71289 37.09024

We can then append the geocoded results back onto the Locations object:

# Append the geocoded locations
Locations_GEO <- merge(Locations, sample_locations_geocoded, by.x="Var1",by.y="sample_locations",all.x = TRUE)
# Remove all the records with no locations
Locations_GEO <- Locations_GEO[!is.na(Locations_GEO$lat),]
# Change the column names
colnames(Locations_GEO) <- c("location","frequency","lon","lat")

We can have a look at these on a map, which shows the main cluster of location clustered around Boulder which is what you might expect.

## Loading required package: sp
## rgdal: version: 1.2-5, (SVN revision 648)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.1.2, released 2016/10/24
##  Path to GDAL shared files: /opt/local/share/gdal
##  Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
##  Path to PROJ.4 shared files: (autodetected)
##  Linking to sp version: 1.2-4

Map context and the web

Although we have previously covered various ways in which we can create maps in R; it is often helpful if we can pull down background maps to help illustrate our cartography. This relies on API again, however these are hidden within the R functions.

We will be using some data from AirBnB concerning the locations of property that has been identified by the owners as being within Manhattan, NYC. We will first read in these data.

# Read in data
listings <- read.csv("./data/listings.csv")
# Calculate a price per bed
listings$price_beds <- listings$price / listings$beds
#Show top six rows
head(listings)
##        id latitude longitude property_type       room_type accommodates
## 1 2082223 40.71031 -74.01638     Apartment    Private room            1
## 2 2986941 40.71728 -74.01524     Apartment Entire home/apt            2
## 3 1712688 40.71177 -74.01730     Apartment Entire home/apt            2
## 4  845495 40.70743 -74.01732     Apartment Entire home/apt            4
## 5 2373737 40.70773 -74.01754     Apartment Entire home/apt            4
## 6 1777007 40.71078 -74.01623     Apartment     Shared room            1
##   bathrooms bedrooms beds bed_type price review_scores_rating price_beds
## 1         1        1    1 Real Bed    80                   90         80
## 2         1        1    1 Real Bed   300                  100        300
## 3         1        0    1 Real Bed   400                   NA        400
## 4         1        1    2 Real Bed   250                   97        125
## 5         1        1    1 Real Bed   255                   93        255
## 6        NA        1    1    Couch    43                   94         43

To plot a base map we use the getmap() function which requires a number of input parameters including “centre” which is a latitude, longitude pair for the centre of the map. For this, we will take mean of the property locations to centre the map. The other parameter required is “zoom” which sets the scale of the map (low number = globe; high number = close to streets). The “maptype” controls the tileset used for the map.

map <- get_map(c(mean(listings$longitude),mean(listings$latitude)),zoom=13,maptype = "roadmap")
P <- ggmap(map) # Note we have stored the basic map in the new object P
P

Another way in which we can setup a map is using a keyword rather than a specific lat/lon. For example, the following example will give you a map of Singapore.

ggmap(get_map("Singapore",zoom=12,maptype = "roadmap"))

As shown in the previous tutorial, we can control elements of the plot within gglpot, and the same is true for ggmap. For example, if we want to hide the axis:

# Add a series of options onto the previously created object P
P +  theme(axis.line = element_blank(),
    axis.text = element_blank(),
    axis.title=element_blank(),
    axis.ticks = element_blank(),
    legend.key = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank())

We can now add the listings (points) to the map - each one has a latitude and longitude co-ordinate. To begin with we will just show the location of the points. We use the “size” option to adjust the point size.

P + geom_point(data=listings, aes(x=longitude, y=latitude),size=2)

You will see this produces a map, however, also creates a warning about missing values - don’t worry, this is just telling you that not all the rows of data in the data frame are visible on the map. You could make this go away if you change the zoom level - i.e. create a map with a greater geographic extent.

You can also adjust the color of the points using the “color” option.

P + geom_point(data=listings, aes(x=longitude, y=latitude, colour=price_beds),size=2)

Because the price per bed is a continuous variable, the points are now scaled along a color gradient from the highest to lowest values. However, this doesn’t show you very much, as most of the values are clustered towards the bottom of the range. We can check this by plotting the values as a histogram. Each bar is a $25 bin.

# Plot a histogram
qplot(price_beds, data=listings, geom="histogram",binwidth=25)

There are a number of ways in which we can adjust our map to make it more effective at communicating changes in price. First we will change the color of the scale to one of the color brewer pallets - for this we use the scale_color_gradientn() function.

# Load colorbrewer
library(RColorBrewer)
#Make plot
P + geom_point(data=listings, aes(x=longitude, y=latitude, colour=price_beds),size=2) + scale_color_gradientn(colours=brewer.pal(9,"YlOrRd")) 

Although the color has changed, we still have the issue with values being clustered at the end of the scale. However, there are a number of additional options that we can use to control for this. The first is “limits” which we can use to adjust the minimum and maximum value on the scale. Here we take the range 75-300.

P + geom_point(data=listings, aes(x=longitude, y=latitude, colour=price_beds),size=2) + scale_color_gradientn(colours=brewer.pal(9,"YlOrRd"),limit=c(75,300)) 

You may have noticed some grey points on the map - these are the properties with values that are outside the ranges specified. We can hide these using a further option “na.value” which you can either assign a color, or as shown in this example, an NA, which makes them hidden.

P + geom_point(data=listings, aes(x=longitude, y=latitude, colour=price_beds),size=2) + scale_color_gradientn(colours=brewer.pal(9,"YlOrRd"),limit=c(100,500),na.value=NA) 

We could for example use this technique to just plot the very expensive property, which we will define as between $400-1000.

P + geom_point(data=listings, aes(x=longitude, y=latitude, colour=price_beds)) + scale_color_gradientn(colours=brewer.pal(9,"YlOrRd"),limit=c(400,1000),na.value=NA) 

We can also use the “scale” option to change the size of the points. For example, we might want to color the points by the bed type, but scale the points by the price.

First of all we will just map the bed type - note that the variable which is attached to is a factor, so ggmap (like ggplot) displays this as a categorical value.

P + geom_point(data=listings, aes(x=longitude, y=latitude, colour=bed_type))

We can see that most of the AirBnB listings concern real beds; although there are other types across Manhattan. We can extend this plot to explore how these relate to price. Again, we will focus on more expensive property between $400 - $1000. For this we add the “size” parameter to the aes() and additionally, use a new function scale_size() which controls the range of point sizes used (in this case 3 to 10). You will see that there are two very expensive couches that can be rented!

P + geom_point(data=listings, aes(x=longitude, y=latitude, colour=bed_type,size=price_beds)) + scale_size(range = c(3, 10),limit=c(400,1000))

Further resources / training